Re: Low Budget Performance - Mailing list pgsql-performance

From eric soroos
Subject Re: Low Budget Performance
Date
Msg-id 78755096.1176234283@[4.42.179.151]
Whole thread Raw
In response to Re: Low Budget Performance  ("Josh Berkus" <josh@agliodbs.com>)
Responses Re: Low Budget Performance  (Andrew Sullivan <andrew@libertyrms.info>)
List pgsql-performance
Josh,

Thanks for the reply.

> One gig is a large database for a single IDE drive -- especially with
> multiple client connections.

That's good to know.

Is a scsi system that much better? Looking at prices, scsi is 1/2 the capacity and double the price for the 80 gig
7200rpmide vs 36 gig 10k rpm scsi. Assuming that I'll never run out of space before running out of performance, I can
dedicate2x the number of ide drives to the problem.  

> > Well, IDE RAID looks like nice optio to me, but before finalising
> > RAID config.,
> > I would advice to test performance and scalability with separate
> > database
> > server and couple of Gigs of RAM.
>
> I'm not convinced that current IDE RAID actually improves database disk
> throughput -- there's a lot of overhead in the one controller I tried
> (Promise).  Does anyone have some statistics they can throw at me?

All of the benchmarks that I've seen show that IDE raid is good for large operations, but for random seek and small
datatransfers, you don't get anywhere near the expected scaling.   

> A cheaper and easier method, involving 3-4 disks:
>
> Channel 1, Disk 1:  Operating System, Swap, and PostgreSQL log
> Channel 1, Disk 2:  WAL Files
> Channel 2, Disk 1:  Database
> Channel 2, Disk 2 (optional):  2nd database data

With IDE, I think I can manage to put each drive on a seperate channel. I've either got one extra controller onboard,
orI can add a 4 channel pci card. From what I've read, this is one of the more important factors in IDE performance.  

> *however*, if you have multiple databases being simulteaneously
> accessesed, you will want to experiment with shuffling around the
> databases and WAL files to put them on different disks.  The principle
> is to divide the disk tasks that are simultaenous ammonng as many disks
> as possible; thus the WAL files always do better on a different disk
> and channel than the database.

That's what I've read about database disk system design. Reduce spindle contention by using lots of drives.
(especiallyin Philip Greenspun's book, but he's talking about 2x7 drives as a minimal configuration and 2x21 as ideal
forlarger systems. And when licensing is more expensive than that sort of drive system, it's all roundoff error.) 

So, assuming that I have three databases with roughly equal load on them, does it make sense to partition them like:

disk 0: os/swap/log/backup staging
disk 1: WAL 1, DB 2
disk 2: WAL 2, DB 3
disk 3: WAL 3, DB 1

Or, in a slightly bigger drive system split 2 ways then mirrored.
disk 0: os etc
Disk 1,2: WAL 1, DB 2
Disk 3,4: WAL 2, DB 1

From an admin point of view, would this be done with alternate locations, symlinks, or multiple concurrent pg
processes?

> > > Does the write ahead logging of PG mean that no matter what indexes
> > and data are changed, that there will be one sync to disk?  Does this
> > reduce the penalty of indexes?
>
> In a word:  No.   Depending on the size of the update, there may be
> multiple synchs.    And indexes do carry a significant penalty on large
> updates; just try runninng 10,000 updates to an indexed column as one
> transaction, and the penalty will be obvious.  In fact, for my data
> load procedures, I tend to drop and re-create indexes.

Most of my update procedures are single row updates, with the exception being things that are already background tasks
thatthe user doesn't notice the difference between 10 and 20 sec. So maybe I'm lucky there. 

> Mirrored drives are different than RAID.   However, you are correct
> that the redundancy/fail-over factor in some RAID and Mirroring comes
> at a performance penalty.

From howtos I've seen, there _can_ be a speed boost with mirroring on read using the linux kernel raid 1. Write
performancesuffers though.  

> But you need to determine where you are actually losing time.

That looks like it will get me started.

eric



pgsql-performance by date:

Previous
From: Andrew Sullivan
Date:
Subject: Re: Low Budget Performance
Next
From: Andrew Sullivan
Date:
Subject: Re: Low Budget Performance